Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added handling of pid error response #686

Merged
merged 13 commits into from
Aug 8, 2024
Merged

added handling of pid error response #686

merged 13 commits into from
Aug 8, 2024

Conversation

bbixler500
Copy link
Contributor

Attempts to fix hwp pid error handling issues

Description

Occasionally the hwp pid controller will read an error value which it does not know how to handle, which will crash the monitoring process. This PR adds handling to this specific error code.

Motivation and Context

Issue described here (link)

How Has This Been Tested?

Briefly ran changes on satp1 hardware. Error rarely occurs, so was unable to verify correct handling explicitly.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@BrianJKoopman
Copy link
Member

Added some reviewers. I expect we're all busy preparing for SPIE, but hopefully one of the three of us can review this soon.

@jlashner
Copy link
Collaborator

Hi @bbixler500, thanks for looking into this. I like your description of the problem here: https://github.com/simonsobs/chwp-discussions/discussions/20#discussioncomment-9610840. Can we add a description like this to the docs of this agent, and maybe a short inline comment in addition?

A few requests:

  • Can we make this error handling a bit more general? I'd rather the decode function check for the ? char indicating an error, and log a warning with the corresponding error if we recognize it, or "unrecognized error" if we don't.
  • I'm really not a fan of returning 9.999 on failure and checking this in calling functions. Is it too late to change this to return a "Result" class that contains stuff like success, value, error_msg? Or alternatively, can we just throw an error?

@ykyohei
Copy link
Contributor

ykyohei commented Jul 29, 2024

Addressed the comments from jack and tested this in daq-dev for satp3.
@bbixler500 @jlashner Could you review it? I believe the build error should be separate issue.

@jlashner
Copy link
Collaborator

jlashner commented Jul 30, 2024

this looks fine, but I have a couple of gripes, some of which may be out of scope for this PR...

Thank you for adding better error checking, but I don't see any place where these messages are printed on failure. This should probably happen in the get_freq function.

Currently I really dislike the structure of the _decode_array function. The "decoded responses" list that it returns can contain a mix of error messages (strings), write confirmations (strings), decoded direction and frequency values (float or int), or just the raw message (string). The problem with this is that there is no way to check what is being returned beyond checking the type of the value or string parsing.

This would benefit a lot from creating a DecodedResponse class or dataclass, which includes fields like message_type, message, measure, success, / whatever else could be useful for parsing the information downstream. The _decode_array function would then return just a list of DecodedResponse objects.

Then you could do better error handling in the get_freq function, for example:

responses = []
responses.append(self.send_message("*X01"))
decoded_resp = self.return_messages(responses)[0]
if decoded_resp.message_type == MessageType.FreqMeasure:
    return decoded_resp.measure
elif decoded_resp.message_type == MessageType.Error:
    print(f"Error reading freq: {decoded_resp.message}")
    # Retry or raise exception
...

Maybe we save this for a future upgrade, but I think this would make error handling easier, and make the module more human-understandable.

@ykyohei
Copy link
Contributor

ykyohei commented Jul 30, 2024

@jlashner Addressed your suggestions and tested by daq-dev again.
I thought success field was redundant as msg_type = 'error' so I didn't include it.

@jlashner
Copy link
Collaborator

Thanks for refactoring so quickly! One recommendation when reading the new code is to make msg and measure optional types that default to None, and then instantiate the class using keyword args instead of passing None everywhere.

@BrianJKoopman
Copy link
Member

Merging the latest from main should fix the builds here now.

jlashner
jlashner previously approved these changes Jul 31, 2024
@ykyohei
Copy link
Contributor

ykyohei commented Aug 1, 2024

@BrianJKoopman Do you have any idea why the integration test for test_cryomech_cpa_agent_integration.py takes more than 6 hours and fails?

@BrianJKoopman
Copy link
Member

BrianJKoopman commented Aug 1, 2024

@BrianJKoopman Do you have any idea why the integration test for test_cryomech_cpa_agent_integration.py takes more than 6 hours and fails?

Hi @ykyohei. The printout in the CI is a bit misleading, sometimes the next line doesn't print where it's really hanging, so it looks like it's hanging on the cryomech tests, but it's really hanging on the HWP PID integration tests. Running locally I see:

agents/test_hwp_supervisor_agent.py ..                                                                                                                  [  7%]
agents/test_ls372_agent.py ..................................                                                                                           [ 28%]
agents/test_ocs_plugin_so.py .                                                                                                                          [ 29%]
agents/test_pysmurf_controller_agent.py ...........                                                                                                     [ 35%]
agents/test_smurf_file_emulator.py ........                                                                                                             [ 40%]
agents/test_suprsync_agent.py ...                                                                                                                       [ 42%]
integration/test_cryomech_cpa_agent_integration.py ......                                                                                               [ 46%]
integration/test_hwp_pid_agent_integration.py .

The error in the agent that I'm seeing it crash on is:

2024-08-01T14-58-19.348429 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ observatory.hwp-pid
2024-08-01T14-58-19.348514 log_file is apparently ./logs//observatory.hwp-pid.log
2024-08-01T14-58-19.348555 --mode agrument is deprecated.
2024-08-01T14-58-19.348786 Using OCS version 0.11.1
2024-08-01T14-58-19.348869 Cannot use relative log_dir without explicit working_dir.
2024-08-01T14-58-19.348905 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ observatory.hwp-pid
2024-08-01T14-58-19.348936 log_file is apparently ./logs//observatory.hwp-pid.log
2024-08-01T14-58-19.348965 --mode agrument is deprecated.
2024-08-01T14-58-19.358661 transport connected
2024-08-01T14-58-19.361244 session joined: {'authextra': {'x_cb_node': '08a8d29cdec9-1',
               'x_cb_peer': 'tcp4:172.28.0.1:60564',
               'x_cb_pid': 22,
               'x_cb_worker': 'worker001'},
 'authid': 'FLKT-K75P-YTHY-EC35-QWFL-QRUG',
 'authmethod': 'anonymous',
 'authprovider': 'static',
 'authrole': 'iocs_agent',
 'realm': 'test_realm',
 'resumable': False,
 'resume_token': None,
 'resumed': False,
 'serializer': 'cbor.batched',
 'session': 4894936459560107,
 'transport': {'channel_framing': 'websocket',
               'channel_id': {},
               'channel_serializer': None,
               'channel_type': 'tcp',
               'http_cbtid': None,
               'http_headers_received': None,
               'http_headers_sent': None,
               'is_secure': False,
               'is_server': False,
               'own': None,
               'own_fd': -1,
               'own_pid': 222346,
               'own_tid': 222346,
               'peer': 'tcp4:127.0.0.1:18001',
               'peer_cert': None,
               'websocket_extensions_in_use': None,
               'websocket_protocol': None}}
2024-08-01T14-58-19.364332 startup-op: launching main
2024-08-01T14-58-19.364454 start called for main
2024-08-01T14-58-19.364534 main:0 Status is now "starting".
2024-08-01T14-58-19.365759 main:0 Status is now "running".
2024-08-01T14-58-19.366198 Forward
2024-08-01T14-58-19.867096 asdfl
2024-08-01T14-58-19.867616 Connected to PID controller
2024-08-01T14-58-19.869391 Finding CHWP Frequency
2024-08-01T14-58-20.033228 start called for get_state
2024-08-01T14-58-20.033366 get_state:1 Status is now "starting".
2024-08-01T14-58-20.033700 get_state:1 Status is now "running".
2024-08-01T14-58-20.370454 ['X010.000']
2024-08-01T14-58-20.371015 DecodedResponse(msg_type='measure', msg='Current frequency = 0.0', measure=0.0)
2024-08-01T14-58-20.371402 Finding target CHWP Frequency
2024-08-01T14-58-20.872546 ['R010000']
2024-08-01T14-58-20.873370 DecodedResponse(msg_type='read', msg='Setpoint = 0.0', measure=0.0)
2024-08-01T14-58-20.873980 Finding CHWP Direction
2024-08-01T14-58-21.375178 ['0']
2024-08-01T14-58-21.375685 0
2024-08-01T14-58-21.377020 main:0 CRASH: [Failure instance: Traceback: <class 'AttributeError'>: 'str' object has no attribute 'msg_type'
/home/koopman/.pyenv/versions/3.11.9/lib/python3.11/threading.py:1045:_bootstrap_inner
/home/koopman/.pyenv/versions/3.11.9/lib/python3.11/threading.py:982:run
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/_threads/_threadworker.py:49:work
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/_threads/_team.py:192:doWork
--- <exception caught here> ---
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/threadpool.py:269:inContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/threadpool.py:285:<lambda>
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/context.py:117:callWithContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/twisted/python/context.py:82:callWithContext
/home/koopman/git/socs/.venv/lib/python3.11/site-packages/ocs/ocs_agent.py:984:_running_wrapper
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:174:main
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:121:_get_data_and_publish
/home/koopman/git/socs/socs/agents/hwp_pid/agent.py:35:get_pid_state
/home/koopman/git/socs/socs/agents/hwp_pid/drivers/pid_controller.py:307:get_direction
]
2024-08-01T14-58-21.377408 main:0 Status is now "done".

EDIT: It is a problem that this doesn't throw an actual error within the testing setup and move on to the next test. I'll try to work on a solution to that. But hopefully this will allow you to fix the error in the meantime.

@ykyohei
Copy link
Contributor

ykyohei commented Aug 2, 2024

Test is fixed and all checks have passed!

@BrianJKoopman BrianJKoopman requested a review from jlashner August 2, 2024 21:40
@BrianJKoopman BrianJKoopman dismissed jlashner’s stale review August 2, 2024 21:40

Changes for fixing the tests should be reviewed.

Copy link
Collaborator

@jlashner jlashner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks kyohei!

@BrianJKoopman BrianJKoopman merged commit ec0831a into main Aug 8, 2024
5 checks passed
@BrianJKoopman BrianJKoopman deleted the hwp_pid_bugfix branch August 8, 2024 14:24
@bbixler500 bbixler500 mentioned this pull request Aug 19, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants